2 research outputs found
Evaluation of an efficient etack-RLE clustering concept for dynamically adaptive grids
This is the author accepted manuscript. The final version is available from the Society for Industrial and Applied Mathematics via the DOI in this record.Abstract.
One approach to tackle the challenge of efficient implementations for parallel PDE simulations
on dynamically changing grids is the usage of space-filling curves (SFC). While SFC algorithms
possess advantageous properties such as low memory requirements and close-to-optimal partitioning
approaches with linear complexity, they require efficient communication strategies for keeping and
utilizing the connectivity information, in particular for dynamically changing grids. Our approach
is to use a sparse communication graph to store the connectivity information and to transfer data
block-wise. This permits efficient generation of multiple partitions per memory context (denoted
by clustering) which - in combination with a run-length encoding (RLE) - directly leads to elegant
solutions for shared, distributed and hybrid parallelization and allows cluster-based optimizations.
While previous work focused on specific aspects, we present in this paper an overall compact
summary of the stack-RLE clustering approach completed by aspects on the vertex-based communication
that ease up understanding the approach. The central contribution of this work is the proof
of suitability of the stack-RLE clustering approach for an efficient realization of different, relevant
building blocks of Scientific Computing methodology and real-life CSE applications: We show 95%
strong scalability for small-scale scalability benchmarks on 512 cores and weak scalability of over 90%
on 8192 cores for finite-volume solvers and changing grid structure in every time step; optimizations
of simulation data backends by writer tasks; comparisons of analytical benchmarks to analyze the
adaptivity criteria; and a Tsunami simulation as a representative real-world showcase of a wave propagation
for our approach which reduces the overall workload by 95% for parallel fully-adaptive mesh
refinement and, based on a comparison with SFC-ordered regular grid cells, reduces the computation
time by a factor of 7.6 with improved results and a factor of 62.2 with results of similar accuracy of
buoy station dataThis work was partly supported by the German Research
Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive
Computing” (SFB/TR 89)
A holistic scalable implementation approach of the lattice Boltzmann method for CPU/GPU heterogeneous clusters
This is the author accepted manuscript. The final version is available from MDPI via the DOI in this record.Heterogeneous clusters are a widely utilized class of supercomputers assembled from
different types of computing devices, for instance CPUs and GPUs, providing a huge computational
potential. Programming them in a scalable way exploiting the maximal performance introduces
numerous challenges such as optimizations for different computing devices, dealing with multiple
levels of parallelism, the application of different programming models, work distribution, and hiding
of communication with computation. We utilize the lattice Boltzmann method for fluid flow as
a representative of a scientific computing application and develop a holistic implementation for
large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and
techniques ranging from optimizations for the particular computing devices to the orchestration
of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with
an implementation using all the available computational resources for the lattice Boltzmann
method operators. Our approach shows excellent scalability behavior making it future-proof for
heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of
more than 90% are achieved leading to 2,604.72 GLUPS utilizing 24,576 CPU cores and 2,048 GPUs of
the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 · 109
lattice cells.This work was supported by the German Research Foundation (DFG) as part of the
Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89). In addition, this work was
supported by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID d68. We further
thank the Max Planck Computing & Data Facility (MPCDF) and the Global Scientific Information and Computing
Center (GSIC) for providing computational resources